Multiple Sequence Alignment Based on Set 

Covers 

Alexandre H. L. Porto 
Valmir C. Barbosa* 

Universidade Federal do Rio de Janeiro 
Programa de Engenharia de Sistemas e Computagao, COPPE 

Caixa Postal 68511 
21941-972 Rio de Janeiro - RJ, Brazil 

December 10, 2004 



Abstract 

We introduce a new heuristic for the multiple alignment of a set of 
sequences. The heuristic is based on a set cover of the residue alpha- 
bet of the sequences, and also on the determination of a significant set 
of blocks comprising subsequences of the sequences to be aligned. These 
blocks are obtained with the aid of a new data structure, called a suffix-set 
tree, which is constructed from the input sequences with the guidance of 
the residue-alphabet set cover and generalizes the well-known suffix tree 
of the sequence set. We provide performance results on selected BAl- 
iBASE amino-acid sequences and compare them with those yielded by 
some prominent approaches. 
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1 Introduction 

The multiple alignment of biological sequences is one of the most important 
problems in computational molecular biology, with applications in many differ- 
ent important domains, such as the analysis of protein structure and function, 
the detection of conserved patterns and domain organization of a protein fam- 
ily, evolutionary studies based on phylogenetic analysis, and database searching 
for new members of a protein family. For recent reviews highlighting the im- 
portance of multiple alignments in molecular biology, we refer the reader to 

[iniiniEniini 

'Corresponding author (valmirOcos .uf rj .br). 
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The problem of multiple sequence alignment can be stated in the following 
manner. Let si, . . . ,Sk, with fc > 2, be sequences of lengths ni, . . . , Uk, respec- 
tively, over a residue alphabet R. An alignment A of these sequences is a A; x Z 
matrix such that ^[i, j], for 1 < i < k and 1 < j < I, is either a character in R 
or the special character that we call a gap character. In A, fixing i and varying 
j from 1 through I must reproduce Si exactly if gap characters are skipped. 
Fixing j, in turn, yields k characters that are said to be aligned in A, of which 
at least one must be in R. Note, then, that max{ni, . . . ,nk} < I < ni + - ■ ■ + nk- 

The goal of the multiple sequence alignment problem |45[ IM| is to determine 
the most biologically significant alignment of si, . . . , Sk- Finding this alignment 
requires an objective function to associate a score with each possible alignment, 
and in this case the multiple sequence alignment problem is to find an alignment, 
known as the optimal alignment, that maximizes the objective function. There 
exist many different objective functions that can be used |14| . but none of them 
guarantees that the corresponding optimal alignment is the most biologically 
significant alignment of the sequences |39) . 

It follows from the definition of an alignment that the number of different 
alignments of a given set of sequences is exponentially large in fact, the 

multiple sequence alignment problem is known to be NP-hard |62l 1^ [7T1 129| . 
Feasible approaches to solve the problem are then all of a heuristic nature, as 
can be seen in [TClB51[Tni[nil^l^l55ll5^[TT]. 

In this paper we describe a new heuristic that is based on set covers of 
the residue alphabet R. Such a set cover is a collection of subsets of R whose 
union yields R precisely. The idea behind the use of a set cover is that each 
subset can be made to contain all the residues from R that possess certain 
structural or physicochemical properties in common [^Sl 1^ ED]- The most 
familiar scenario for the use of set covers is the case of i? as a set of amino 
acids, so henceforth we concentrate mainly on multiple protein alignments, even 
though it makes perfect sense for i? to be a set of nucleotides as well. Set covers 
of an amino-acid alphabet have been studied extensively, as for example in 

|Snilini|321llHll2HllSllllP 52 6^[TniiniE7|. 

Set covers lie at the heart of the new heuristic. In essence, what they do is 
to allow the introduction of a new data structure, called a suffix-set tree, that 
generalizes the well-known suffix tree of a set of sequences and can be used in 
the determination of subsequence blocks that ultimately give rise to the desired 
alignment. In general terms, this is the same approach as some others in the 
literature [SH ESI El EH EH] , but our use of set covers as its basis provides 
a fundamentally more direct link between relevant properties shared by groups 
of residues and the resulting alignment. 

The following is how the remainder of the paper is organized. In Section|21we 
introduce our new data structure and in Section El describe our new approach 
to multiple sequence alignment. Then we proceed in Section^ to an experi- 
mental study of the new method as compared to some of its most prominent 
competitors, and finalize with conclusions in Section El 
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2 Suffix-set trees 



In this section we describe our new data structure. It is a generalization of the 
well-known suffix tree of a set of sequences |65l 116) . which is one of the most 
important data structures in the field of pattern recognition. Such a suffix tree 
has 0{ni + • • • + nk) nodes and can be constructed in 0{ni + • • • + nk) time by 
several algorithms 1^ ICT [HI . 

Suffix trees can be applied to many problems, but their principal application 
in computational molecular biology is to assist the algorithms that try to ob- 
tain blocks comprising biologically significant subsequences of a set of sequences, 
known as the motifs of that set (8j. These motifs, in the case of proteins, encode 
structural or functional similarities; in the case of nucleic acids, they encode 
mainly the promoter regions. There exist several algorithms for extracting mo- 
tifs from a set of sequences, based on a multitude of different algorithmic and 
mathematical methodologies. We refer the reader to @1 021 in the general 
case, and to [55] in the case of nucleic acids. 

2.1 Definition and properties 

Let C = {Ci, . . . , Cp} be a set cover of R, and let Eg = • • ■ ? Q^p, ct$} be a 
new alphabet having a character ai for each Ci € C and a further character 
a$ possessing a function similar to the terminator used in suffix trees. Like the 
suffix tree, our new data structure is also a rooted tree; it has edges labeled 
by sequences of characters from Ec and nodes labeled by indices into some of 
si, . . . , Sfc to mark suffix beginnings. We call it a suffix-set tree and it has the 
following properties: 

• The first character of the label on an edge connecting a node to one of its 
children is a different character of Ec for each child. 

• Each nonempty suffix of every one of the k sequences is associated with at 
least one leaf of the tree; conversely, each leaf of the tree is associated with 
at least one nonempty suffix of some sequence (if more that one, then all 
suffixes associated with the leaf have the same length).^ Thus, each leaf 
is labeled by a set like Ji ),..., for some q> 1, where {ia,ja), 
for 1 < a < 9: 1 ^ ^ fc, and 1 < < , indicates that the suffix 
Si„ [ja ■ ■ n-i^ ] of Si^ is associated with the leaf. 

• Let w be a node of the tree. The label of v is the set {(ii, ji), . . . , iiq,jq)} 
that represents the q suffixes collectively associated with the leaves of the 
subtree rooted at w. If ■ ■ ■ ol^^ is the concatenation of the labels on the 
path from the root of the tree to w, excluding if necessary the terminal 
character a$, then ■ • • Q^c^ is a common representation of all prefixes 
of length r of the suffixes associated with the leaves of the subtree rooted 
at V. If \ja ■ ■ ni^] is one of these suffixes, then for I < b < r we have 

^ These properties may alternatively be extended to accommodate the empty suffix as well. 
We refrain from doing so for simplicity's sake only. 
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Si^lia + 6—1] G Cc(, (that is, the 6th character of the suffix is a member 

We discuss an example shortly, but let us first examine a simple procedure 
to construct a suffix-set tree. Before that, though, the reader should note that, 
when C is a partition of R into singletons, the suffix-set tree becomes the familiar 
suffix tree of si,...,s/c. In order to see this, it suffices to identify for each 
character in each sequence the member Ci of C to which it belongs, and then 
substitute ai for that character.^ 

2.2 A simple construction algorithm 

We now describe an algorithm to construct the suffix-set tree of a set of se- 
quences. The algorithm generates a tree T with n-r nodes having the properties 
described above. For each suffix of each sequence, the algorithm first processes 
its first character, then its second character, and so on. We describe it as the 
following three steps: 

1. Create a root r for T and label it with a set comprising every possible pair 
(i, j), i.e., for I <i < k and I < j < rii. Let v = r and c = 1. 

2. Let {(«!, ji), . . . , {iq,jq)} be the label of v. Let Dc, for every C e C, and 
D$ be (initially empty) sets of node-label elements. For each {ia,ja) in 
the label of v, check whether ja + c — 1 < rii^. In the affirmative case 
(i.e., the suffix Si„[ja ■ -f^ia] least c characters), add (ia,ja) to each 
Dc such that Si^[ja + c — 1] e C; in the negative case, add {ia,ja) to D$. 
With C = {C € C \ Dc %} initially, while there exist C, C G C such 
that Dc C Dc, do C = C \ {C}. 

3. If C = 0, I?$ 7^ 0, and v ^ r, then append oi% to the label of the edge 
connecting v to its parent. If C = {Ci}, = 0, and v ^r, then append 
ai to the label of the edge connecting v to its parent and go to Step|21with 
c -|- 1 in place of c. Otherwise, create a new leaf i and the edge e between 
V and £ if £)$ ^ 0, then label £ with £)$ and e with the character a$. Also, 
for each Ci G C, create a new node Vd and the edge Cd between v and 
vci, then label Vd with Dd, Cd with a^, and go to Step El with v — Vd 
and c + 1 in place of c. 

Notice that, in StepEl we eliminate from C every C for which C" G C exists 
such that Dc C Dc' ■ This implies that sets in C that are subsets of other sets 
also in C become useless. Whenever this is for some reason undesirable, the 
pruning of C may be replaced by a simple elimination of duplicates (yielding, 
necessarily, a larger suffix-set tree). 

^We include edge labels in our definition of the suffix-set tree only on account of this 
aspect of it as a generalizer of the suffix tree. Edge labels are nowhere strictly necessary in 
this paper's application of the suffix-set tree, but there may exist other applications that make 
use of them, just like the case with suffix trees. 
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We show in Figure the suffix-set tree obtained by this latter ahernative 
for R = {A,C,G,T}, Ci = {A,G,T}, C2 = {T,C}, si = AGCTAG, and 
52 = GGGATCGA. Had Steps been used as presented, the result would 
have been essentially the same, except for the inexistence of nodes v and U2 
(and all adjoining edges) and the addition of an edge between w and ui. 

If we recall that p is the number of sets in C, then it is relatively simple to 
see that Steps ^3 run in O {jit {p{ni + • • • + Uk) +p^|-Rp)) time and require 
O {riTini + • • • + Tifc)) space. In these figures, the time complexity is essentially 
dominated by Step|2 the space complexity by the need to store node labels. 
What is worrisome in both cases is the linear dependence on the number of 
nodes of the suffix-set tree, tt-t, since it is conceivable that tit may itself depend 
exponentially on maxjrii, . . . , Tifc} for some sequence sets and set covers in the 
worst case. 

Let us then probe a little more deeply into the nature of tit- One first 
important observation is that, notwithstanding such a worst-case possibility, we 
have found by means of experiments that in many biologically relevant cases the 
expected value of ut grows polynomially, not exponentially, with ni -I- • • • -f rifc. 
The results of these experiments are shown in the plots of Figure[21 in which part 
(a) refers to the set cover from , here denoted by I, and part (b) refers to the 
set cover from ^Sj, here denoted by 5. In both cases, R is the set of amino acids; 
the set covers are given in Tabled In the figure, solid plots refer to StepsmSas 
presented and give the average value (by itself and plus or minus the standard 
deviation) of tit over single-sequence suffix-set trees for 10^ randomly generated 
sequences.'^ Dashed plots, correspondingly, refer to the alternative version of 
the tree-constructing algorithm. In all cases, it is clear from the plots' linearity 
with respect to the two logarithmic scales that, on average, tit depends only 
polynomially on the sequence's length. Not only this, but it also seems true 
that deviations above the average occur only in modest amounts. 

The second important observation is related to our preferred use of the 
tree-constructing algorithm in Section 13 In the best-performing options of 
the strategy to be described in that section (cf. Section 0| as well) , we do not 
construct the suffix-set tree to completion, but rather only the portion of the 
tree that is needed to represent all suffix prefixes of length at most M (a fixed 
parameter). Achieving this requires a simple modification to StepHJ we add the 
pair (ia,ja) to the Dc sets if ja + c— 1 < rn^ and in addition c < M . Clearly, we 
now have that ut is 0{p^^), therefore polynomial in p given the fixed parameter 
M. 

^Is is curious to notice, in Figure |2jb), that the three soUd plots practically overlap one 
another. This happens because the set pruning performed at the end of Step |21 turns the set 
cover S into a partition of R. Not every subset of _R in this partition is a singleton, so the 
suffix-set tree that is constructed is not the suffix tree of the sequence on R. Nevertheless, the 
suffix-set tree can be regarded as the suffix tree of the single sequence on E5 that corresponds 
by trivial substitution to the sequence on R. 
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Figure 1: An example suffix-set tree, node labels shown only for the leaves. 

Each node label is expressed more compactly than in the text; for example, 
"12,114" stands for the label {(1, 2), (2, 4)}. 
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Figure 2: Expected behavior of Ht for single-sequence sufSx-set trees under the 
T (a) or <S (b) set cover. 
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Table 1: Amino-acid set covers. 



Cover set 


I 


S 


Ci 


{M,I,L,V} 


{P} 


C2 


{MJ,L,V,A,P} 


{A,G} 


C3 


{MJ,L,V,F,W} 


{D.E} 


Ca 


{MJ,L,V,A,P,F,W} 


{N,Q} 


Cs 


{D,E,H,R,K} 




Ce 


{S,T,Q,N} 


{F,W,Y} 


C7 


{S,T,Q,N,D,E} 


{H,K,R} 


Cg 


{Q, N, D, E, H, R, K) 


{I-L,V} 


C9 


{S,T,Q,N,D,E,H,R,K} 


{C,FJ,L,M,V,W,Y} 


Cio 


{Q,N} 


{D,E,H, K, N,Q,R,S, T} 


Cii 


{D,E,Q,N} 




C12 


{H, R,K} 




Cl3 


{R,K} 




Cl4 


{F,W,Y} 




Cl5 


{G,iV} 




C16 


{A,C,G,S} 




Ci7 






C18 


{D,E} 





3 Alignments from set covers 

For 2 < fc' < fc, let ^1, . . . , tk' be subsequences of si, . . . , such that each ta 
is a subsequence of a different s;^, with 1 < a < k' and 1 < ia < k. We call 
{ti, . . . ,tk'} a block. If A' is an alignment of ^i, . . . , t^,' having I' columns, then 
the score of A', denoted by S{A'), is a function, to be introduced shortly, of 

/' fc'-l k' 

r(-4') = EE E Q{pl,Pl,A'[a,c],A'[b,c]), (1) 

c— 1 a—1 b—a+1 

where is the position in Si^ of the rightmost character of ta whose column 
in A' is no greater than c (pj^ — 0, if none exists), and similarly for pg. In 
Oj Q (Pa)Pb5-4'[a, c], ^'[6, c]) is the contribution to T{A') of aligning the two 
characters A'[a, c] and A'[b, c] given and pg. If both y^'[a, c] and .4'[5, c] are 
gap characters, then we let Q {p'^,pl,A'[a,c],A'[b,c\) = 0. Otherwise, the value 
of Q {Pa,Pi„A'[a, c],A'[b, c]) is determined through a sequence of two conceptual 
steps. 

The first step is the determination of the combined number of optimal global 
and local pairwise alignments of s;^ and that go through the Pa,Ph cell of 
the dynamic-programming matrix. We dwell very briefly on how this number 
can be found, mentioning only that, even though it can be exponentially large 
IHIE], a simple extension of the procedure of in the global case or of 0^1 
in the local case suffices to compute it efficiently. In what follows, we split this 
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number into three others, and let each of Ui{p'^,pl), . . . ,U3{p^,pl) be either 
a hnearly normalized version of the corresponding number within the interval 
[1, L] for L a parameter, if that number is strictly positive, or 0, otherwise. We 
use f/i(p^,pg) in reference to the case of optimal alignments through the Pa,Pl 
cell that align Si„[Pa] to Sijp^]' ^'^iPa^Pb) ahgnments of Si„[Pa] to ^ S^P 
character, and t/3(Pa:Pb) alignments of Si5[p^] to a gap character. 

The second step is the determination of Q {p^,pl,A'[a, c],A'[b, c]) itself from 
the quantities Ui{pl,pl), • ■ • . MpI.pD- « Ui{pl,pl) = ' ' ' = C^3b^,pg) = (no 
optimal alignments through thepj^,pg cell), then we let Q [PaiPti A' [a, c], ^'[6, c]) = 
—L. Otherwise, we have the following cases to consider, where z G {1, 2, 3} se- 
lects among Ui, U2, and J/s depending, as explained above, on ^'[a,c] and 
A'[h,c]: 

. If U,{pl,pl) > 0, then we let Q ^'[a, c], ^'[6, c]) = C/.(p^,pg). 

• If UziPaiPb) = 0, then we let 

Q [pl , , c] , ^' [6, c] ) - -min+ { U, {pi ,pl),U2ipl,pt), C/3 (pI^pI)} , 

where we use min"*" to denote the minimum of the strictly positive argu- 
ments only. 

What this second step is doing is to favor the alignment of A' [a, c] to A'[b, c] in 
proportion to its popularity in optimal pairwise alignments of Si^ and Si^, and 
similarly to penalize it — heavily when cell Pa^Pb is part of no optimal pairwise 
alignment, less so if it is but not aligning A'[a, c] to A'[b, c\. 

Finally, the function that yields S{A') from T{A') is designed to differentiate 
two alignments of different blocks for which T might yield the same value. We do 
so by subtracting off T{A') the fraction of |T(y^')| obtained from the average of 
two numbers in [0, 1]. The first number is l~k' /k and seeks to privilege (decrease 
T by a smaller value) the block with the greater number of subsequences. The 
second number is a function of the so-called identity score of an alignment, 
that is, the fraction of the number of aligned residue pairs that corresponds 
to identical residues. If we denote the identity score of A' by I{A'), then the 
second number is 1 — I{A') and aims at privileging alignments whose identity 
scores are comparatively greater. We then have 

S{X) = T{A!) - (2 - ^ - /(-A')) (2) 

The remainder of this section is devoted to describing our heuristic to obtain 
a. k X I alignment A of the sequences si, . . . ,Sk, given the set cover C of the 
residue alphabet R. The sufRx-set tree T of si, . . . , Sfe plays a central role, and 
we assume it only represents suffix prefixes whose lengths do not exceed M, the 
parameter introduced in Section |21 We first describe how to obtain a suitable 
set B of blocks from si, . . . , Sk, and then how to obtain A from B. 
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3.1 Blocks from set covers 



We start by creating a set B of blocks that is initialized to and is augmented 
by the inclusion of new blocks as they are generated. The size of B is at all 
times bounded by yet another parameter, denoted by N. 

Every node f of T that is not the root may contribute blocks to B. If n„ < fc 
is the number of distinct sequences with suffixes associated with v and 1^ < M 
is the common prefix length of all those suffixes, then each block contributed 
by V is formed exclusively by some of the length-/^ prefixes as its subsequences, 
totaling at least two and including at most one from each of the n„ sequences. 

Let Ab denote an alignment of block i?'s subsequences. Block formation for 
node V proceeds as follows. First the sequences are sorted in nonincreasing 
order of their numbers of suffixes associated with v and a new block is created 
for each prefix of the first-ranking sequence. An attempt is then made to add 
more prefixes to each such block B by visiting the remaining n„ — 1 sequences 
in order and selecting for each one the prefix, if any, that increases S{Ab) the 
most when As acquires another row by the addition of that prefix. It is worthy 
of mention that, as prefixes coalesce into the final form of B, Ab never contains 
any gap characters, in which case Q is seen to revert to a simpler form. But 
notice that the functional form in Q continues to be effective, not only because 
of the identity scores, but also because the expansion of B involves comparing 
alignments that may differ in numbers of rows (a unit difference, to be specific).^ 
At the end, all blocks still having one single sequence are discarded. 

Once u's contribution to B is available, it is sorted and then merged into 
B (we can think of B as being internally organized as a sorted list). Both 
operations seek to retain inside B those blocks whose alignments' scores are 
greater. If needed, additional tie-breaking criteria are employed, including those 
that are already reflected in jSJ: greater numbers of subsequences and identity 
scores are preferred. 

Now say that two blocks B and B' are such that B is contained in B' if every 
subsequence of B is itself a subsequence of the corresponding subsequence of B'. 
Once every non-root node of T has been considered for contributing to B, the 
resulting B is further pruned by the removal of every block that is contained in 
another of at least the same score. After this, B undergoes another fix, which 
is to extend its blocks by the addition of new subsequences so that at the end 
they all contain exactly one subsequence from each of si, . . . , s^. 

The following is how we achieve this extension for block B Q B. We consider 
the unrepresented sequences one by one in nonincreasing order of their lengths. 
Then we use a variation of a semi-global algorithm for aligning a sequence 
to a subsequence of another sequence fHl to ahgn the current Ab to a. 
subsequence of the unrepresented sequence under consideration. This variation 
is straightforward and employs the Q function in place of a substitution matrix 

*In a similar vein, comparing candidate now rows based solely on their effect on S{Ab) 
may at times be insufficient. In our current implementation we use the simpler T{Ab) as well 
if needed to help break ties. 
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and gap costs. As B is thus extended, its alignment Ab changes as well, and 
may now acquire gap characters for the first time. 

The final step in this setup process of B is to once again examine all its blocks 
and remove every block B such that either S{Ab) < or Ab is contained in 
Ab' for some other B' G B for which S{Ab) < S{Ab')-^ The containment 
of one alignment in another is a notion completely analogous to that of block 
containment introduced above. 

3.2 Alignments from blocks 

The B that we now have contains /c-subsequence blocks exclusively, all having 
nonnegative-score alignments that are not contained in one another (except 
when the container has a lower score). In this new phase of the heuristic we 
build a weighted acyclic directed graph D from B. Manipulating this graph 
appropriately eventually yields the desired alignment A of si, . . . , Sk- 

The node set of Z? is S U {s,t}, where s and t are two special nodes. In D, 
an edge exists directed from s to each B G B, and also from each B ^ B to 
t. No other edges are incident to s or t, which are then a source and a sink in 
D, respectively (i.e., edges only outgo from s or income to t). The additional 
edges of D are deployed among the members of B in the following manner. For 
B, B' E B, an edge exists directed from B to B' if every subsequence of B starts 
to the left of the corresponding subsequence in B' in the appropriate sequence of 
Si, . . . , Sfe. In addition, if B and B' overlap, then Ab and Ab' are also required 
to be identical in all the overlapping columns. Edges deployed in this manner 
lead, clearly, to an acyclic directed graph. 

In D, both edges and nodes have weights. Edge weights depend on how the 
blocks intersect the sequences si, . . . ,Sk- Specifically, if an edge exists from B 
to B' and the two blocks are nonoverlapping, then its weight is —x, where x 
is the standard deviation of the intervening sequence-segment lengths. Edges 
outgoing from s or incoming to t are weighted in the trivially analogous manner. 

Weights for edges between overlapping blocks and node weights are com- 
puted similarly to each other (except for s and t, whose weights are equal to 
0). If a; is the number of residues in node B, then its weight is x/^/k. In the 
case of an edge between the overlapping B and B' , we let x be the number of 
common residues and set the edge's weight to —x/\/lt. We remark, finally, that 
this weight-assignment methodology is very similar to the one in |68) , the main 
difference being that we count residues instead of alignment sizes. 

Having built D, we are then two further steps away from the final alignment 
A. The first step is to find an s-to-t directed path in D whose weighted length 
is greatest. Since D is acyclic, this can be achieved efficiently 0. Every block 
B appearing on this optimal path immediately contributes Ab as part of A, 
but there still remain unaligned sequence segments. 

^One technicality here is that, since all blocks have at this point the same number k of 
subsequences, j2j no longer holds as given but rather as the simpler form in which the 1 — fc'/fc 
term is done away with and furthermore the division by 2 is likewise not performed. 
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The second step and final action of the heuristic is then to complete the 
missing positions of A. We describe what is done between nonoverlapping suc- 
cessive blocks, but clearly the same has to be applied to the left of the first block 
on the optimal path and to the right of the last block. Let B and B' be nonover- 
lapping blocks appearing in succession on the optimal path. Let ti, . . . ,tfe be 
the intervening subsequences of si, . . . , that are still unaligned. We select 
the largest of ti, . . . ,tk and use it to initialize a new alignment along with as 
many gap characters as needed for every one of ti,. . . ,tk that is empty. We 
then visit each of the remaining subsequences in nonincreasing length order and 
align it to the current, partially built new alignment. The method used here is 
totally analogous to the one used in Section 13.11 for providing every block with 
exactly k subsequences, the only difference being that a global (as opposed to 
semi-global) procedure is used. 

4 Computational results 

We have conducted extensive experimentation in order to evaluate the per- 
formance of the heuristic of Section |2| Our strategy has been to employ the 
BAliBASE suite [2^10] as the source of sequence sets, and to seek comparative 
results vis-a-vis some prominent approaches, namely CLUSTAL W [SIj PRRN 
P) . DIALIGN ISlOni, T-COFFEE gD], and MAFFT 23,. Some of these have 
already been the subject of comparative studies as a group |S7], while some 
others constitute more recent proposals. 

The BAliBASE suite comprises 167 families of amino-acid sequences divided 
into eight reference sets, each of which especially constructed to emphasize some 
of the most common scenarios related to multiple sequence alignment. The suite 
contains a reference alignment for each of its families, in most cases along with 
motif annotations given the reference alignment. We have concentrated our 
experiments on the families for which such annotations are available, namely 
the first five reference sets p. 

The metrics we use to evaluate a certain alignment A arc motif-constrained 
versions of those originally introduced along with the BAliBASE suite |57|:^ 
the sum-of- pairs score (SPS), denoted by SPS{A)^ and the column score (CS), 
denoted by CS{A). If we let pabc be a 0-1 variable such that Pabc = 1 if and 
only if the residues A[a, c] and A\b, c] share the same column in the BAliBASE 
reference alignment for the same sequences, then the version of SPS{A) we use 
is the average value of pabc over the residue pairs that are annotated as belonging 
to motifs in the reference alignment. Similarly, if Cc is the 0-1 variable having 
value 1 if and only if all the residues in column c of ^ also share a column in 
the reference alignment, then we use CS{A) as the average value of Cc over the 
columns of motifs in the reference alignment. 

All the experiments with the heuristic of Section were carried out with 
M — 1, . . . , 4, L = 20, and N = 200. The set covers we used were the I and S 

^We do point out, however, that alternative methodologies have also been used 1221 1251 . 
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Table 2: Gap initialization and extension costs. Sources: ^T) for BL0SUM62 and 
PAM250; [TK) for VTML160. 





Gap costs 


Substitution matrix 


Global alignments Local alignments 




Initialization Extension Initialization Extension 


BL0SUM62 


7.5 0.9 8.0 0.5 


PAM250 


11.0 0.5 6.0 1.3 


VTML160 


14.0 2.0 14.0 2.0 



set covers of Table Gap costs were as given in Table El for each of BL0SUM62 
[Tt] . PAM250 0, and VTML160 ^ as the substitution matrix. 

We divide our presentation of results into two groups. The first group is given 
in Figures OHSl respectively for the BL0SUM62, PAM250, and VTML160 matrices, 
and also in Figure (see below) . In each of Figures |2Hil part (a) refers to the 
T set cover, part (b) to the S set cover. Also, figure legends are given as "x_y," 
where "x" is either the value of M or else the word "all," in this case indicating 
that the full suffix-set tree was constructed, and "y" is either "c" or "nc" (for 
"compact" and "non-compact"), indicating respectively whether the tree was 
constructed according to Steps ^[31 of Section [S] or by the alternative procedure 
indicated in that section (results that would be labeled "all_nc" are altogether 
absent, since in the case of the I set cover the full tree as constructed by the 
alternative procedure is large to the extent of rendering the entire approach 
impractical) . 

What we show in each figure are average SPS and CS values inside each of 
the BAliBASE reference sets we considered. Clearly, the choices labeled "2_c" 
emerge from Figures OHHl as the best alternatives nearly always. When we group 
the results of these choices together for the various combinations of substitution 
matrix and set cover, we obtain what is depicted in Figure Clearly, the 
VTML160-5 pair appears as a first choice most often, with the only noteworthy 
exception of Reference Set 4, in which case it seems best to use the pair PAM250- 
I. 

Still with regard to Figures El-El it is worth noting that the cases labeled 
"all" or "nc" are consistently among the worst possibilities. Referring back 
to our discussion in Section El we see that this trend favors the use of the 
techniques that produce the smaller suffix-set trees (viz. prefix-length bounding 
and Steps^Elas presented in that section). What this means is that, in addition 
to the generally superior scores, we are also to expect a more efficient use of 
computational resources as the suffix-set tree is constructed. 

Our second group of results refers to comparing the heuristic of Section El to 
the five competing approaches mentioned earlier. The approaches that predate 
the introduction of the BAliBASE suite were run with default parameters (this 
is the case of CLUSTAL W, DIALIGN, and PRRN), while the others, having 
appeared with experimental results on the BAliBASE suite when first published, 
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Figure 3: Average scores for the heuristic of Section with the BL0SUM62 sub- 
stitution matrix and the X (a) or S (b) set cover. 
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Figure 4: Average scores for the heuristic of Section O with the PAM250 substi- 
tution matrix and the X (a) or S (b) set cover. 
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Figure 5: Average scores for the heuristic of Section |31 with the VTML160 substi- 
tution matrix and the T (a) or S (b) set cover. 
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Figure 6: Average scores for the "2_c" cases of Figures EHSl 



were run with their best parameter choices (this apphes to T-COFFEE and to 
MAFFT). As for our heuristic, the results we use in the comparison are those 
singled out above as the champions of Figure IHI 

Comparative results are given in Figure[7| where the same style of Figures|31- 
|H|is used, and in Table|2| where total running times are given. ^ It is clear from 
Figure that no absolute best can be identified throughout all the reference 
sets. As we examine the reference sets individually, though, we see that at 
least one of the two substitution-matrix, set-cover pairs used with our heuristic 
is in general competitive with the best contender. Noteworthy situations are 
the superior performance of our heuristic on Reference Set 5, and also its weak 
performance on Reference Set 3. As for the running times given in Table our 
current implementation is seen to perform competitively as well when compared 
to the others. 



5 Concluding remarks 

We have in this paper introduced a new heuristic for the multiple alignment 
of a set of sequences. The new heuristic is based on a set cover of the residue 
alphabet of the sequences, and also on finding significant subsequence blocks to 
be combined and eventually yield the desired alignment. Central to our heuristic 
is the notion of a sufHx-set tree, a new data structure that generalizes the well- 
known suffix tree of a set of sequences. Even though this tree can conceivably 
have a number of nodes that is exponential in the largest of the sequence lengths, 

^AU running times were obtained on an Intel Pentium 4 processor running at 1.8 GHz with 
1 Gbytes of main memory. 
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Figure 7: Average scores for the two best cases of FigureEland the five competing 
approaches. 



Table 3: Running times for the two best cases of Figure|Sland the five competing 
approaches. 



Approach Time (seconds) 



PAM250-X 


2058.76 


VTML 160-5 


1843.27 


CLUSTAL W 


111.30 


DIALIGN 


338.05 


MAFFT 


95.59 


PRRN 


2269.84 


T-COFFEE 


4922.50 
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we have demonstrated by means of experiments on the BAhBASE suite that 
best results are achieved when the tree is only partiaUy constructed, in which 
case its number of nodes becomes polynomial in the size of the set cover. 

Our experiments on the BAliBASE suite indicate that the new heuristic is 
competitive when compared to a selection of existing strategies. This is true 
both in terms of motif-constrained versions of the BAliBASE standard scores 
and also in terms of running times for completion. As we focus in a more 
detailed fashion on individual performance results, our heuristic occasionally 
outperforms all others, and equally occasionally falls behind them. We find it, 
then, to be a relevant approach, one that sheds new light on the usefulness of 
residue-alphabet set covers. 

Many of our heuristic's details can still undergo improvements that go be- 
yond the mere search for a more efficient implementation. One possibility is 
clearly the use of potentially better pairwise alignments, both global and local, 
when they are needed as described in Sectional This possibility is already ex- 
ploited by T-COFFEE, which not only employs a position-specific score matrix, 
but also (as in the implementation used for our comparisons) uses CLUSTAL 
W to obtain global pairwise alignments and LALIGN ^Sj to obtain a set of 
highly significant local pairwise alignments. We also see improvement possibil- 
ities in the block- and alignment-extension methods described at the ends of 
Sections l3.ll and IX^ respectively. In these two occasions, sequences or subse- 
quences are considered in nonincreasing order of their lengths, which of course is 
an approach simple to the point of in no way taking into account the biological 
significance of the sequences or subsequences. 

It is also apparent from our presentation of the heuristic in Section 01 that 
several options exist for many of its building parts. This refers not only to 
choosing parameter values but also to selecting auxiliary algorithms at several 
points. Whether better choices exist in terms of yielding even more significant 
alignments, and doing it perhaps faster as well, remains to be verified. 
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